Data Analysis in Happy Moments corpus

Introduction

Over decades, linguists have shown that language form and language in context could be different in people with different background. Today, I conduct data analysis on HappyDB to explore the variation of written text. HappyDB is a corpus of 100,000 crowd-sourced happy moments which people shared in the past three months or past 24 hours (https://rit-public.github.io/HappyDB/ ).

In this study, I would like to investigate how the happy moments change among people in different countries with different age and different marital status.

Length and Topics are the main keys in a sentence, therefore I use exploratory data analysis (boxplot, line plot, and scatterplot), word clouds, bigrams and topic modeling to analyze on the two aspects.

Note: Text Processing are based on Professor Ying Liu, Arpita Shah and Tian Zheng’s tutorial

Text Processing : process the raw textual data for our data analysis.

Step 0 - Load all the required libraries

library(tm)
library(tidytext)
library(tidyverse)
library(DT)
library(scales)
library(wordcloud2)
library(gridExtra)
library(ngram)
library(shiny) 

Step 1 - Load the data to be cleaned and processed

urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/cleaned_hm.csv'
hm_data <- read_csv(urlfile)

Step 2 - Preliminary cleaning of text

We clean the text by converting all the letters to the lower case, and removing punctuation, numbers, empty words and extra white space.

corpus <- VCorpus(VectorSource(hm_data$cleaned_hm))%>%
  tm_map(content_transformer(tolower))%>%
  tm_map(removePunctuation)%>%
  tm_map(removeNumbers)%>%
  tm_map(removeWords, character(0))%>%
  tm_map(stripWhitespace)

Step 3 - Stemming words and converting tm object to tidy object

Stemming reduces a word to its word stem. We stem the words here and then convert the “tm” object to a “tidy” object for much faster processing.

stemmed <- tm_map(corpus, stemDocument) %>%
  tidy() %>%
  select(text)

Step 4 - Creating tidy format of the dictionary to be used for completing stems

We also need a dictionary to look up the words corresponding to the stems.

dict <- tidy(corpus) %>%
  select(text) %>%
  unnest_tokens(dictionary, text)

Step 5 - Removing stopwords that don’t hold any significant information for our data set

We remove stopwords provided by the “tidytext” package and also add custom stopwords in context of our data.

data("stop_words")

word <- c("happy","ago","yesterday","lot","today","months","month",
                 "happier","happiest","last","week","past")

stop_words <- stop_words %>%
  bind_rows(mutate(tibble(word), lexicon = "updated"))

Step 6 - Combining stems and dictionary into the same tibble

Here we combine the stems and the dictionary into the same “tidy” object.

completed <- stemmed %>%
  mutate(id = row_number()) %>%
  unnest_tokens(stems, text) %>%
  bind_cols(dict) %>%
  anti_join(stop_words, by = c("dictionary" = "word"))

Step 7 - Stem completion

Lastly, we complete the stems by picking the corresponding word with the highest frequency.

completed <- completed %>%
  group_by(stems) %>%
  count(dictionary) %>%
  mutate(word = dictionary[which.max(n)]) %>%
  ungroup() %>%
  select(stems, word) %>%
  distinct() %>%
  right_join(completed) %>%
  select(-stems)

Step 8 - Pasting stem completed individual words into their respective happy moments

We want our processed words to resemble the structure of the original happy moments. So we paste the words together to form happy moments.

completed <- completed %>%
  group_by(id) %>%
  summarise(text = str_c(word, collapse = " ")) %>%
  ungroup()

Step 9 - Keeping a track of the happy moments with their own ID

hm_data <- hm_data %>%
  mutate(id = row_number()) %>%
  inner_join(completed)

datatable(hm_data)

Exporting the processed text data into a CSV file

write_csv(hm_data, "../output/processed_moments.csv")

The final processed data is ready to be used for any kind of analysis.

Data Precrocessing

Step 1 - Load the processed text data along with demographic information on contributors

We use the processed data for our analysis and combine it with the demographic information available.

hm_data <- read_csv("../output/processed_moments.csv")

urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'
demo_data <- read_csv(urlfile)
#  Combine both the data sets and keep the required columns for analysis
#  We select a subset of the data that satisfies specific row conditions.
hm_data <- hm_data %>%
  inner_join(demo_data, by = "wid") %>%
  select(wid,
         original_hm,
         gender, 
         marital, 
         parenthood,
         reflection_period,
         age, 
         country, 
         ground_truth_category, 
         text) %>%
  mutate(count = sapply(hm_data$text, wordcount)) %>%
  filter(gender %in% c("m", "f")) %>%
  filter(marital %in% c("single", "married")) %>%
  filter(parenthood %in% c("n", "y")) %>%
  filter(reflection_period %in% c("24h", "3m")) %>%
  mutate(reflection_period = fct_recode(reflection_period, 
                                        months_3 = "3m", hours_24 = "24h"))

Exploratory Data Analysis

Countries

First thing trigered my interest is whether there exists some differences among countries since the writing culture are different in countries.

h1 <- hm_data%>%
  group_by(country)%>%
  summarize(l=length(country))%>%
  arrange(desc(l))

head(h1)
## # A tibble: 6 x 2
##   country     l
##   <chr>   <int>
## 1 USA     73475
## 2 IND     16629
## 3 VEN       546
## 4 CAN       531
## 5 GBR       355
## 6 PHL       276
h2 <- subset(hm_data, hm_data$country=="USA"|hm_data$country=="IND"|hm_data$country=="VEN"|
               hm_data$country=="CAN"|hm_data$country=="GBR"|hm_data$country=="PHL")

I select 6 countries with observations are more than 200 as larger observations will have more accurate results.

ggplot(h2, aes(x = marital, y = count, col = gender))+
  geom_boxplot()+facet_wrap(~country)+theme_bw()

For Canada, Great British, Philippines, and Venezuela, the distribution of the length of each sentence are approximately the same between married and single; female and maele (almost all within 50 words)

India and USA show some interesting patterns: For India, male used words to describe their happy moments longer than female when they are married. For USA, the count of words have similar pattern in marital and gender, but they tend to talk more compared with other countries.

Therefore, it seems worthwhile to focus on the two countries India and USA.

Now let’s see if age can show us some captivating findings

hm_data1 <- subset(hm_data, hm_data$country=="USA"|hm_data$country=="IND")
h3 <- subset(hm_data1, hm_data1$age != "prefer not to say")
h3$age <- as.numeric(h3$age)
h3 <- subset(h3, h3$age!="2" & h3$age!="3" & h3$age!="227" & h3$age!="233")

We remove age with 2, 3, 227, 233 and prefer not to say which could be typo and meaningless message. We suspect people have college education or above are willing to write down more words then people who do not have college degree.

We collect some education statistics from wikipedia shows the number of students in college over 60 years. (https://en.wikipedia.org/wiki/Higher_education_in_the_United_States)

year <- c(1950, 1970, 1990, 2009)
college.degree <- c(432000 + 58200, 827000+ 208000, 1052000 + 325000, 1600000 + 657000 )
degree.df <- data.frame(year,college.degree)
ggplot(degree.df, aes(x = year, y = college.degree)) + geom_line() + theme_bw()

The line plot shows an increasing trend in people getting educationsince 1950. In other words, more and more young people nowadays get better education than elder people.

h4 <- h3%>%
  group_by(age)%>%
  summarize(length.avg = mean(count))

h5 <- merge(h3, h4)
ggplot(h5, aes(x = age, y = count, col = marital)) + 
  geom_point() + theme_bw()

ggplot(h5, aes(x = age, y = length.avg, col = marital)) + 
  geom_line(position = position_dodge(width = 1)) + theme_bw()

According to the scatterplot we can see that there’s a slow trend of decreasing in words as the age grows older. Especially for people who are younger than 30, there are many of them included more than 100 words in one sentence which match with our guess.

On the other hand, the average length is about the same throughout the age but there’s an obvious different at age 84.

Word Cloud in different countries with different marital status

set.seed(101)

### Word Cloud with USA & married
bag_of_words <-  h3 %>%
  filter(country == "USA", marital == "married")%>%
  unnest_tokens(word, text)

word_count <- bag_of_words %>%
  count(word, sort = TRUE)

wordcloud2(word_count, size = 0.6, rotateRatio = 0)
### Word Cloud with USA & single
bag_of_words <-  h3 %>%
  filter(country == "USA", marital == "single")%>%
  unnest_tokens(word, text)

word_count <- bag_of_words %>%
  count(word, sort = TRUE)

wordcloud2(word_count, size = 0.6, rotateRatio = 0)
### Word Cloud with India & married
bag_of_words.ind <-  h3 %>%
  filter(country == "IND" , marital == "married")%>%
  unnest_tokens(word, text)

word_count.ind <- bag_of_words.ind %>%
  count(word, sort = TRUE)

wordcloud2(word_count.ind, size = 0.6, rotateRatio = 0)
### Word Cloud with India & single

bag_of_words.ind <-  h3 %>%
  filter(country == "IND" , marital == "single")%>%
  unnest_tokens(word, text)

word_count.ind <- bag_of_words.ind %>%
  count(word, sort = TRUE)

wordcloud2(word_count.ind, size = 0.6, rotateRatio = 0)

From the word clouds we can see, in USA, married people value Time, Daughter and Husband more compared with married people in India where Day, Time, Friend are more important. There is no much different between two countried in single group. Friend, time, day, moment are brought up more frequent.

Bigrams between Countries and Marital Status

Bigram for USA and India

#Create bigrams using the text data
hm_bigrams <- h3 %>%
  filter(count != 1) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigram_counts <- hm_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  count(word1, word2, sort = TRUE)
hm_bigrams %>%
      count(country, bigram, sort = TRUE) %>%
      group_by(country) %>%
      top_n(10) %>%
      ungroup() %>%
      mutate(bigram = reorder(bigram, n)) %>%
      ggplot(aes(bigram, n, fill = country)) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~ country, ncol = 2, scales = "free") +
      coord_flip()
## Selecting by n

The bigram shows the differences in topics people focus on between India and USA. American put more focus on video game, ice cream, mother day whereas Indian focus more on moment life, day life, birthday party.

Bigram for Marital

hm_bigrams %>%
      count(marital, bigram, sort = TRUE) %>%
      group_by(marital) %>%
      top_n(10) %>%
      ungroup() %>%
      mutate(bigram = reorder(bigram, n)) %>%
      ggplot(aes(bigram, n, fill = marital)) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~ marital, ncol = 2, scales = "free") +
      coord_flip()
## Selecting by n

On the other hand, the most popular words are almost the same in married and single groups. They all value spend time, video games, moment life and ice cream.

Topic modeling

Text Mining

library(tm)
h.us.m <- subset(h3, h3$country=="USA" & h3$marital == "married")
h.us.s <- subset(h3, h3$country=="USA" & h3$marital == "single")
h.in.m <- subset(h3, h3$country=="IND" & h3$marital == "married")
h.in.s <- subset(h3, h3$country=="IND" & h3$marital == "single")

docs.usa.married <- Corpus(VectorSource(h.us.m$text))
docs.usa.single <- Corpus(VectorSource(h.us.s$text))
docs.ind.married <- Corpus(VectorSource(h.in.m$text))
docs.ind.single <- Corpus(VectorSource(h.in.s$text))
# Coverting into a document term matrix
dtm.usa.married <- DocumentTermMatrix(docs.usa.married)
dtm.usa.single <- DocumentTermMatrix(docs.usa.single)
dtm.ind.married <- DocumentTermMatrix(docs.ind.married)
dtm.ind.single <- DocumentTermMatrix(docs.ind.single)

# Find the sum of words in each Document
rowTotals.usa.married <- apply(dtm.usa.married, 1, sum)
rowTotals.usa.single <- apply(dtm.usa.single, 1, sum)
rowTotals.ind.married <- apply(dtm.ind.married, 1, sum)
rowTotals.ind.single <- apply(dtm.ind.single, 1, sum)
dtm.usa.married <- dtm.usa.married[rowTotals.usa.married > 0, ]
dtm.usa.single <- dtm.usa.single[rowTotals.usa.single > 0,]
dtm.ind.married <- dtm.ind.married[rowTotals.ind.married > 0,]
dtm.ind.single <- dtm.ind.single[rowTotals.ind.single > 0,]

LDA

Implement Latent Dirichlet Allocation to find the top 10 terms in each topics

library(topicmodels)
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
#Number of topics
k <- 10
#Run LDA using Gibbs sampling
ldaOut.usa.married <-LDA(dtm.usa.married, k, method="Gibbs", control=list(nstart=nstart, 
                                                 seed = seed, best=best,
                                                 burnin = burnin, iter = iter, 
                                                 thin=thin))
ldaOut.usa.single <-LDA(dtm.usa.single, k, method="Gibbs", control=list(nstart=nstart, 
                                                 seed = seed, best=best,
                                                 burnin = burnin, iter = iter, 
                                                 thin=thin))
ldaOut.ind.married <-LDA(dtm.ind.married, k, method="Gibbs", control=list(nstart=nstart, 
                                                 seed = seed, best=best,
                                                 burnin = burnin, iter = iter, 
                                                 thin=thin))
ldaOut.ind.single <-LDA(dtm.ind.single, k, method="Gibbs", control=list(nstart=nstart, 
                                                 seed = seed, best=best,
                                                 burnin = burnin, iter = iter, 
                                                 thin=thin))
#docs to topics
ldaOut.topics.usa.married <- as.matrix(topics(ldaOut.usa.married))
ldaOut.topics.usa.single <- as.matrix(topics(ldaOut.usa.single))
ldaOut.topics.ind.married <- as.matrix(topics(ldaOut.ind.married))
ldaOut.topics.ind.single <- as.matrix(topics(ldaOut.ind.single))
#top 10 terms in each topic
ldaOut.terms.usa.married <- as.matrix(terms(ldaOut.usa.married,10))
ldaOut.terms.usa.single <- as.matrix(terms(ldaOut.usa.single,10))
ldaOut.terms.ind.married <- as.matrix(terms(ldaOut.ind.married,10))
ldaOut.terms.ind.single <- as.matrix(terms(ldaOut.ind.single,10))
ldaOut.terms.usa.married
##       Topic 1    Topic 2    Topic 3    Topic 4    Topic 5    Topic 6    
##  [1,] "found"    "wife"     "daughter" "watched"  "friend"   "job"      
##  [2,] "received" "dinner"   "son"      "played"   "family"   "happiness"
##  [3,] "bought"   "enjoyed"  "school"   "game"     "birthday" "life"     
##  [4,] "car"      "lunch"    "told"     "won"      "visit"    "live"     
##  [5,] "money"    "nice"     "excited"  "son"      "talked"   "moment"   
##  [6,] "gift"     "eat"      "event"    "fun"      "called"   "promotion"
##  [7,] "shopping" "favorite" "book"     "favorite" "surprise" "people"   
##  [8,] "store"    "night"    "love"     "movie"    "weekend"  "positive" 
##  [9,] "pay"      "cooked"   "read"     "laugh"    "vacation" "meet"     
## [10,] "card"     "food"     "learned"  "team"     "sister"   "person"   
##       Topic 7   Topic 8   Topic 9     Topic 10   
##  [1,] "husband" "time"    "finally"   "dog"      
##  [2,] "home"    "day"     "started"   "walk"     
##  [3,] "feel"    "morning" "finished"  "love"     
##  [4,] "house"   "night"   "baby"      "cat"      
##  [5,] "makes"   "kids"    "weeks"     "beautiful"
##  [6,] "cleaned" "hours"   "completed" "run"      
##  [7,] "didnt"   "spend"   "helped"    "park"     
##  [8,] "moved"   "mother"  "project"   "planted"  
##  [9,] "brought" "sleep"   "ive"       "garden"   
## [10,] "decided" "spent"   "couple"    "weather"
ldaOut.terms.usa.single
##       Topic 1     Topic 2    Topic 3    Topic 4   Topic 5      Topic 6    
##  [1,] "feel"      "watched"  "found"    "dog"     "dinner"     "finally"  
##  [2,] "live"      "played"   "bought"   "home"    "ate"        "finished" 
##  [3,] "life"      "game"     "car"      "hours"   "lunch"      "school"   
##  [4,] "people"    "favorite" "money"    "morning" "eat"        "event"    
##  [5,] "happiness" "won"      "store"    "love"    "food"       "completed"
##  [6,] "event"     "video"    "pay"      "night"   "favorite"   "class"    
##  [7,] "moved"     "movie"    "buy"      "cat"     "delicious"  "college"  
##  [8,] "person"    "fun"      "card"     "sleep"   "night"      "project"  
##  [9,] "moment"    "team"     "shopping" "run"     "restaurant" "hard"     
## [10,] "helped"    "tickets"  "dollars"  "cleaned" "cooked"     "learned"  
##       Topic 7      Topic 8    Topic 9     Topic 10   
##  [1,] "time"       "job"      "friend"    "day"      
##  [2,] "family"     "received" "talked"    "nice"     
##  [3,] "visit"      "ive"      "boyfriend" "walk"     
##  [4,] "weekend"    "weeks"    "birthday"  "mother"   
##  [5,] "mom"        "book"     "called"    "enjoyed"  
##  [6,] "sister"     "started"  "met"       "park"     
##  [7,] "girlfriend" "recently" "phone"     "weather"  
##  [8,] "spend"      "hit"      "havent"    "relax"    
##  [9,] "brother"    "gym"      "surprise"  "beautiful"
## [10,] "trip"       "lost"     "date"      "pretty"
ldaOut.terms.ind.married
##       Topic 1      Topic 2     Topic 3    Topic 4    Topic 5      
##  [1,] "friend"     "feel"      "day"      "family"   "bought"     
##  [2,] "birthday"   "makes"     "school"   "time"     "fun"        
##  [3,] "enjoyed"    "happiness" "surprise" "movie"    "office"     
##  [4,] "celebrated" "people"    "enjoyed"  "shopping" "purchased"  
##  [5,] "party"      "love"      "class"    "played"   "job"        
##  [6,] "marriage"   "found"     "birthday" "watched"  "colleagues" 
##  [7,] "brother"    "live"      "phone"    "kids"     "received"   
##  [8,] "met"        "started"   "gift"     "spend"    "discussions"
##  [9,] "meet"       "hours"     "finally"  "favorite" "buy"        
## [10,] "function"   "age"       "passed"   "game"     "team"       
##       Topic 6    Topic 7    Topic 8     Topic 9    Topic 10  
##  [1,] "son"      "helped"   "moment"    "daughter" "temple"  
##  [2,] "home"     "remember" "life"      "love"     "visit"   
##  [3,] "morning"  "people"   "time"      "event"    "car"     
##  [4,] "wife"     "walk"     "person"    "parents"  "trip"    
##  [5,] "mother"   "night"    "movement"  "baby"     "bike"    
##  [6,] "food"     "road"     "waiting"   "talked"   "enjoyed" 
##  [7,] "husband"  "told"     "results"   "sister"   "summer"  
##  [8,] "father"   "water"    "marks"     "called"   "travel"  
##  [9,] "sister"   "laugh"    "completed" "child"    "vacation"
## [10,] "prepared" "smile"    "change"    "wife"     "drive"
ldaOut.terms.ind.single
##       Topic 1         Topic 2        Topic 3    Topic 4    Topic 5   
##  [1,] "friend"        "people"       "home"     "moment"   "time"    
##  [2,] "enjoyed"       "job"          "finally"  "life"     "feel"    
##  [3,] "movie"         "completed"    "shopping" "school"   "called"  
##  [4,] "played"        "company"      "money"    "college"  "car"     
##  [5,] "watched"       "father"       "planned"  "exam"     "song"    
##  [6,] "meet"          "successfully" "started"  "results"  "live"    
##  [7,] "night"         "person"       "waiting"  "event"    "uncle"   
##  [8,] "experience"    "movement"     "hard"     "class"    "received"
##  [9,] "game"          "feel"         "suddenly" "passed"   "baby"    
## [10,] "unforgettable" "makes"        "told"     "remember" "blissful"
##       Topic 6     Topic 7     Topic 8    Topic 9      Topic 10   
##  [1,] "parents"   "happiness" "day"      "family"     "love"     
##  [2,] "met"       "day"       "surprise" "birthday"   "food"     
##  [3,] "office"    "feel"      "sister"   "celebrated" "girl"     
##  [4,] "trip"      "bike"      "mother"   "brother"    "visit"    
##  [5,] "mom"       "bought"    "dad"      "gift"       "beautiful"
##  [6,] "nice"      "god"       "phone"    "party"      "match"    
##  [7,] "share"     "talked"    "forget"   "home"       "favorite" 
##  [8,] "purchased" "memories"  "birth"    "marriage"   "morning"  
##  [9,] "smile"     "joy"       "excited"  "relatives"  "dinner"   
## [10,] "ready"     "photos"    "bought"   "house"      "spend"

Some words are used in all four sections, such as family, husband, life. Friends appears more in Indian section.While American have more interest in food, Indian shows more interest in party and song.

#write out results
write.csv(ldaOut.terms.usa.married,file=paste("../output/LDAGibbs",k,"TopicsToTermsUSAMarried.csv"))
write.csv(ldaOut.terms.usa.single,file=paste("../output/LDAGibbs",k,"TopicsToTermsUSASingle.csv"))
write.csv(ldaOut.terms.ind.married,file=paste("../output/LDAGibbs",k,"TopicsToTermsINDMarried.csv"))
write.csv(ldaOut.terms.ind.single,file=paste("../output/LDAGibbs",k,"TopicsToTermsINDSingle.csv"))

Conclusion

After analying the sentence length and topics in USA and India with different age and different marital status people, we come to the conclusion that age is an intriguing factor: young people tend to write down more words than elder people but on average senior write longer sentences than young people. Marital status and gender do not show much difference. Married American put more life focus on family yet Indian put more focus on enjoying life with friends.